{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7ebcb8cf",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "source": [
    "#  22\\.  异质集成投票方法应用  # \n",
    "\n",
    "##  22.1.  介绍  # \n",
    "\n",
    "前面集成学习算法实验中，我们重点介绍了 Bagging 和 Boosting 两类，其中 Bagging 主要是应用了投票法。但无论是 Bagging Tree 还是随机森林，都采用了决策树算法进行同质集成。本次挑战中，我们将学会应用不同算法进行异质集成学习。 \n",
    "\n",
    "##  22.2.  知识点  # \n",
    "\n",
    "  * CART 决策树分类 \n",
    "\n",
    "  * 网格搜索参数选择 \n",
    "\n",
    "本次挑战我们依旧沿用集成学习实验中的学生成绩数据集。我们先加载数据集并完成训练和测试集切分。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cbece66d",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "outputs": [],
   "source": [
    "wget -nc https://cdn.aibydoing.com/aibydoing/files/course-14-student.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "505f9610",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "stu_data = pd.read_csv(\"course-14-student.csv\", index_col=0)\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    stu_data.iloc[:, :-1], stu_data[\"G3\"], test_size=0.3, random_state=35\n",
    ")\n",
    "\n",
    "X_train.shape, X_test.shape, y_train.shape, y_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bdf4979d",
   "metadata": {},
   "outputs": [],
   "source": [
    "((276, 26), (119, 26), (276,), (119,))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72a21983",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "source": [
    "##  22.3.  投票分类器  # \n",
    "\n",
    "下面，我们介绍 scikit-learn 提供的 ` VotingClassifier  ` 投票分类器 [ 官方文档 ](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) 。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38f1c67f",
   "metadata": {},
   "outputs": [],
   "source": [
    "sklearn.ensemble.VotingClassifier(estimators, voting='hard')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "293ca669",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "source": [
    "其中： \n",
    "\n",
    "  * ` estimators  ` ：可以通过列表套元组的方式 ` [('name1',  model1),  ('name2',  model2)]  ` 传入多个不同的分类器。 \n",
    "\n",
    "  * ` voting  ` ：可选 ` hard  ` 或 ` soft  ` 。 \n",
    "\n",
    "当 ` voting='hard'  ` 时，相当于前面说过的多数投票法。例如对于某样本判定： \n",
    "\n",
    "  * 分类器 1 → 类别 1 \n",
    "\n",
    "  * 分类器 2 → 类别 1 \n",
    "\n",
    "  * 分类器 3 → 类别 2 \n",
    "\n",
    "最终预测该样本属于类别 1。 \n",
    "\n",
    "当 ` voting='soft'  ` 时，相当于前面说过的加权投票法。例如对于某样本判定，我们预先设定 3 个类别的权重为  $w_1=1$  ,  $w_2=1$  ,  $w_3=1$  ，那么根据分类器返回的类别概率，就可以得到最终在 3 个类别上的平均概率，示例计算表格如下： \n",
    "\n",
    "分类器  |  类别 1  |  类别 2  |  类别 3   \n",
    "---|---|---|---  \n",
    "分类器 1  |  w1 * 0.2  |  w1 * 0.5  |  w1 * 0.3   \n",
    "分类器 2  |  w2 * 0.6  |  w2 * 0.3  |  w2 * 0.1   \n",
    "分类器 3  |  w3 * 0.3  |  w3 * 0.4  |  w3 * 0.3   \n",
    "平均概率  |  0.37  |  0.4  |  0.23   \n",
    "  \n",
    "表格中，以分类器 1 为例，分类器返回的 3 个类别概率为 ` 0.2,  0.5,  0.3  ` ，需要分别乘以预设权重 ` w1,  w2,  w3  ` 。最终，因为类别 2 平均加权概率最大，所以判定样本属于类别 2。 \n",
    "\n",
    "Exercise 22.1 \n",
    "\n",
    "挑战：学习并使用 VotingClassifier 完成异质集成投票分类。 \n",
    "\n",
    "规定：比较逻辑回归，决策树，朴素贝叶斯高斯方法等 3 个「个体分类器」与其组成的 VotingClassifier 测试结果，可以自由选定参数。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b98da13b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import VotingClassifier\n",
    "\n",
    "## 代码开始 ### (>5 行代码)\n",
    "\n",
    "## 代码结束 ###"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b904da88",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "source": [
    "参考答案  Exercise 22.1 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "999cae23",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import VotingClassifier\n",
    "\n",
    "# 代码开始 ### (>5 行代码)\n",
    "clf1 = LogisticRegression(\n",
    "    solver='lbfgs', multi_class='auto', max_iter=1000, random_state=1)\n",
    "clf2 = DecisionTreeClassifier(random_state=1)\n",
    "clf3 = GaussianNB()\n",
    "\n",
    "eclf = VotingClassifier(\n",
    "    estimators=[('lr', clf1), ('dt', clf2), ('gnb', clf3)], voting='hard')\n",
    "\n",
    "for clf, label in zip([clf1, clf2, clf3, eclf],\n",
    "                      ['LogisticRegression:', 'DecisionTreeClassifier:',\n",
    "                       'GaussianNB:', 'VotingClassifier:']):\n",
    "    clf.fit(X_train, y_train)\n",
    "    scores = clf.score(X_test, y_test)\n",
    "    print(label, round(scores, 2))\n",
    "### 代码结束 ###"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa27ace7",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "source": [
    "最终期望得到 3 个分类器与 VotingClassifier 在测试集上的分类准确率。 \n",
    "\n",
    "期望输出 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7969d1f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "LogisticRegression: 0.76\n",
    "DecisionTreeClassifier: 0.87\n",
    "GaussianNB: 0.53\n",
    "VotingClassifier: 0.78"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7addd15d",
   "metadata": {},
   "source": [
    "上面给出的结果仅供参考，数值随着参数可能发生变化。 \n",
    "\n",
    "你可能在完成挑战的过程中会发现 ` VotingClassifier  ` 的结果不一定优于全部的个体分类器，例如上面参考输出给出的结果一样。实际上这是正常现象，个体学习器往往会对训练数据过拟合，而 ` VotingClassifier  ` 因为采用多数投票可以很好避免。除此之外 ` voting='soft'  ` 可能会比 ` voting='hard'  ` 的结果更为糟糕，原因是不一定能准确把握类别权重的设置。 \n",
    "\n",
    "所以， ` VotingClassifier  ` 不一定是在单数据集上表现最好的分类器，但应该是能够较好避免过拟合的分类器。 "
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "main_language": "python",
   "notebook_metadata_filter": "-all"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
